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Abstract 
Introduction. Life expectancy is, by definition, the average number of years a person can expect to live from birth to 


death. It is therefore the best indicator for assessing the health of human beings, but also a comprehensive index for 
assessing the level of economic development, education and health systems. From our extensive research, we have 
found that most existing studies contain qualitative analyses of one or a few factors. There is a lack of quantitative 
analyses of multiple factors, which leads to a situation where the predominant factor influencing life expectancy cannot 
be identified with precision. However, with the existence of various conditions and complications witnessed in society 
today, several factors need to be taken into consideration to predict life expectancy. Therefore, various machine learning 
models have been developed to predict life expectancy. The aim of this article is to identify the factors that determine 
life expectancy. 

Materials and Methods. Our research uses the Pearson correlation coefficient to assess correlations between 
indicators, and we use multiple linear regression models, Ridge regression, and Lasso regression to measure the 
impact of each indicator on life expectancy. For model selection, the Akaike information criterion, the coefficient of 
variation and the mean square error were used. R* and the mean square error were used. 

Results. Based on these criteria, multiple linear regression was selected for the development of the life expectancy 
prediction model, as this model obtained the smallest Akaike information criterion of 6109.07, an adjusted coefficient 
of 85 % and an RMSE of 3.85. 

Conclusion and Discussion. At the end of our study, we concluded that the variables that best explain life expectancy 
are adult mortality, infant mortality, percentage of expenditure, measles, under-five mortality, polio, total expenditure, 
diphtheria, HIV/AIDS, GDP, longevity of 1.19 years, resource composition, and schooling. 


The results of this analysis can be used by the World Health Organization and the health sectors to improve society. 
Keywords: life expectancy, machine learning, machine learning models. 


Acknowledgments. This paper has been supported by the RUDN University Strategic Academic Leadership 
Program. 

For citation. Brou Kouame Amos, I. Smirnov. Determinants Factors in Predicting Life Expectancy Using Machine 
Learning. Advanced Engineering Research, 2022, vol. 22, no. 4, pp. 373-383. https://doi.org/10.23947/2687-1653- 
2022-22-4-373-383 


© Kouame Amos, Smirnov I., 2022 


VUndopmatnka, BBIGHMCIIMTeCIbHadA TEXHUNKa U yiipaBJICHHe 


http://vestnik-donstu.ru 


374 


Advanced Engineering Research 2022. V. 22, no. 4. P. 373-383. ISSN 2687-1653 


Introduction 

Human life expectancy can be understood as a statistic used in demography to estimate the average age at which 
people in a given region and at a given time can be expected to live under current conditions [1]. Life expectancy is not 
only a statistical indicator of human health, but also a means of assessing the degree of economic, educational, 
health [2] and environmental development. It should be noted that the World Health Organization (WHO) 
considers life expectancy to be a key, if not the most important, indicator of health, reflecting the instruments of 
human existence [3—6]. In the majority of the world countries, life expectancy has in fact increased. Global life 
expectancy increased from 67.2 years in 2005 to 70.8 years in 2015'. The United Nations and individual national 
governments now have the optimization of human life expectancy, health and well-being as their main 
objective [7-8]. The UN has been a strong promoter of human health by providing sanitary remedies, which 
greatly improves the urban environment and helps developing countries’. In the African region of the World Health 
Organization, life expectancy is 61.2 years, while in the European region, it is 77.5 years, giving a ratio of 1.3 between 
the two regions [9]. Analysis of the disparities in life expectancy between developed and developing countries will 
enable the United Nations to improve its health promotion and humanitarian assistance activities. It will also 
enable governments of different nations to establish more effective policies to increase life expectancy and 
improve living standards. States would be able to significantly increase the life expectancy of their population by 
investing more in the health care system [7]. According to [10], the increase in life expectancy in the United States as a 
function of per capita income is substantially related to the increase in income level. According to some researchers, the 
relevant factors affecting life expectancy are mainly environmental, social and economic factors. These vary 
according to location and involve economic development, medical and health requirements. Existing research has 
given rise to debates about the factor that determines life expectancy. In [11], the author analysed life expectancy 
in Tibet, China, and found the main factors determining life expectancy. The author in [12], thought that social 
economy played an important role in determining life expectancy in the early stages of development. However, it 
was replaced by diet and lifestyle when economic development reached a certain level. The article [13] considered 
that the determining factor in the evolution of life expectancy in Eastern Europe was lifestyle. 

In the end, several studies have identified numerous factors influencing life expectancy. However, few studies 
compared economic development with environmental factors to analyse the intensity of their impact on life 
expectancy. Several studies are trying to find out the determining factor of life expectancy. It must be said that there are 
several. Epidemiological studies in developed countries reveal large differences in life expectancy that are often highly 
complex. A current study in the United States suggests that 10 to 38 % of the differences in life expectancy can be 
explained by work-related stress. Life expectancy depends on many factors such as economic status, regional changes 
in education, gender disparities, physical and mental illnesses, alcohol consumption, GDP, health care spending, and 
many other demographic factors. Life expectancy has actually increased during the 20th and 21st centuries in 
industrialized countries [14—20]. The improvement in life expectancy in Europe is followed by a population growth in 
the over-50 age group. There were 179 million people aged 50 or over in all EU Member States in 2008, and 
195 million in 2014, with women accounting for about 55 % of the total [21-24]. 

According to some sources, the level of economic development has a significant effect on life expectancy. Indeed, 
studies have shown that people who are financially well off and those from wealthy families tend to have a higher life 
expectancy [25, 26]. For some researchers, other economic development variables such as GDP per capita [27, 28], 


urbanisation rate [29] and level may affect life expectancy to different degrees. Some studies show that environmental 


' United Nations Statistical Yearbook, 2017 edition. United Nations, New York; 2017. 
* United Nations Economic and Development website: https://www.un.org/chinese/esa/health.htm (accessed: 9 February, 2021) 
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factors are determinants of life expectancy [29]. Indeed, according to [29, 30], most environmental factors, such as 
ecological resilience and environmental sustainability, are positively correlated with life expectancy, while some 
factors, including biodiversity, are negatively correlated with life expectancy. In [31], current environmental 
conditions influence the life expectancy of the population at birth, while cumulative changes in circumstances 
continue to influence the remaining life expectancy of the population at different ages over time. J. O. Anderson 
in [32] thought that people living in an environment with high levels of particulate pollutants over a long period of 
time had higher cardiovascular morbidity, and that there was some degree of dose dependence. Other researchers 
have studied the impact of different environmental variables on life expectancy. A. Wuffle [33] compared the 
average temperature of all US states. The results showed that the lower the average temperature in November, the 
higher the life expectancy of the population in those states [34-36]. 

Therefore, through machine learning, we will determine the factors that influence life expectancy. Machine 
learning (ML) can be understood as a discipline that lies at the intersection of mathematics, statistics and computer 
science. Machine learning has played an important role in the development of artificial intelligence (AI). Thus, artificial 
intelligence, through machine learning, helps companies to prevent problems and increase profits. In the field of health, 
machine learning is still surprising researchers. It is now the most widely used tool for prediction and forecasting. 
Machine learning, which represents a cutting-edge technology due to its predictive accuracy in several problems, 
is widely used to increase life expectancy by reducing the mortality rate [17]. Indeed, given that several elements 
impact on life expectancy, the multiple regression model is of paramount importance and corresponds to the 
exploration of the specific relationship and level of impact between several factors and life expectancy. 

This paper uses multiple linear regression models and Pearson's correlation coefficient to examine the relationship 
between several variables on life expectancy and provide more help for future research on both sides. These models are 
also used as a basis for suggestions to states for improving life expectancy in order to achieve a development of human 
society. 

Materials and Methods 

The World Health Organization (WHO) Global Health Observatory (GHO) data repository tracks health status and 
many other related factors for all countries. The datasets are made available to the public for analysis of health data. 
Data on life expectancy and health factors for 193 countries were collected on the same WHO website and the 
corresponding economic data were collected on the UN website. From all categories of health-related factors, only the 
most representative critical factors were selected. It has been observed that in the last 15 years, the health sector has 
undergone enormous development, resulting in improved human mortality rates, especially in developing countries, 
compared to the last 30 years. Therefore, in this project, we considered data from the year 2000 to 2015 for 
193 countries for further analysis. The individual data files were merged into a single dataset. Initial visual inspection of 
the data revealed some missing values. As the datasets were from WHO, we did not find any obvious errors. 

Our dataset had missing data, and the missing data were for population, hepatitis B and GDP. Imputation of missing 
data using the *mice* function in the R package of the same name. 

Each variable is associated with an imputation model, conditional on the other variables in the data set: if we have 
X,, variables, the missing data for the variable X; will be replaced by the predictions of a model created from the other 
variables. 

The final file is composed of 22 attributes, the target variable “life expectancy” and various other social factors, such 
as total expenditure on life, population, education and health factors, such as BMI, measles, etc. These data are 
available on Kaggle [18]. All predictors were then divided into several broad categories: vaccination-related factors, 


mortality factors, economic factors, and social factors. 
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Data mining 

The objective of this section is to gain a better understanding of the data by extracting information from the data. 
We mainly want to determine the relationship between the variables. 

The correlation matrix, visualized using a heat map (as shown in Figure 1), is one of the best ways to understand the 
correlation between variables. It is plotted using the R library “Reshape2” and shows us the strength of the linear 
relationships between the variables. The linear relationship between the outcome and the characteristics can be 
estimated by a correlation matrix. In multivariate analysis, it plays an important role, as it elaborates the relationship 
between the different components [19]. Looking at Figure 1, we can see that: 

the target variable Life expectancy is strongly correlated (positively or negatively) with: 

— Adult mortality (-0.70); 

— HIV/AIDS (-0.56); 

— Composition of income resources (0.72); 

— Schooling (0.75). 

There is also a very low correlation between the target variable Life Expectancy and Population (-0.02) or no 
correlation at all. 

The child deaths variable is extremely positively correlated with deaths under five years of age (1.00). 

The GDP variable and the percentage of expenditure are positively correlated (0.90). 

The Hepatitis B variable is moderately positively correlated with Polio and Diphtheria (0.49) and (0.61). 

The variables diphtheria and polio are strongly positively correlated (0.67). 

The HIV/AIDS variable is negatively correlated with resource composition (-0.25). 

The thinness variable 10 ... 19 years is very strongly positively correlated with the thinness variable 5 ... 9 years (0.94). 


The variable Schooling and the income composition of resources are very strongly correlated (0.8). 
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Fig. 1. Visualisation of the correlation matrix by heat map (the authors' figure) 
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By examining the correlation coefficients in Figure 1, we detect potential predictors of life expectancy. For each 
numerical variable that is potentially predictive of life expectancy, we will run a simple linear regression between it and 
the life expectancy variable, display the Pearson correlation coefficient and its P 
y = ax +b. y is the dependent variable and x is the independent variable. a and b are the model parameters (a is the 
slope of the fitted line and b is the intercept). 

From the result of this exploratory analysis on our data, we concluded that adult mortality, HIV/AIDS, BMI, income 
composition, and education are the most important factors in predicting life expectancy. This selection was made on the 
basis of the Pearson correlation coefficient and p-value (as shown in Figures 2, 3, 4, 5, 6). 

— The correlation between the variable infant mortality, GDP, alcohol, percentage of expenditure, hepatitis B, 
measles, under-five deaths, polio, total expenditure, diphtheria, population, age 1-19, age 5—9, and life expectancy is 
Statistically significant as their p-values are less than 0.001, but the linear relationship between these variables is weak 
with a Pearson correlation coefficient of less than 0.5. Under these conditions, the variables child deaths, GDP, alcohol, 
percentage of expenditure, hepatitis B, measles, under-five deaths, polio, total expenditure, diphtheria, population, 
leanness 1—19 years, leanness 5—9 years, cannot be considered as a predictor of life expectancy. 

— There is a strong negative correlation between the variables Adult Mortality and Life Expectancy with a Pearson 
correlation coefficient of —0.7 and statistically significant since the p-value is less than 0.001. In other words, as adult 
mortality increases, life expectancy decreases. Under these conditions, the adult mortality variable can be considered a 
predictor of life expectancy. 

— There is a negative correlation between the variables HIV/AIDS and life expectancy with a Pearson correlation 
coefficient of -0.56 and statistically significant since the p-value is less than 0.001. As the number of people affected by 
HIV/AIDS increases, life expectancy decreases. Under these conditions, the HIV/AIDS variable can be considered a 
predictor of life expectancy. 

— There is a positive correlation between the BMI and life expectancy variables with a Pearson correlation 
coefficient of 0.56 and statistical significance since the p-value is less than 0.001. Under these conditions, the BMI 
variable can be considered a predictor of life expectancy. 

— There is a strong positive correlation between the variables Income composition of resources and life expectancy 
with the Pearson correlation coefficient of 0.69 and statistically significant as the p-value is less than 0.001. The graph 
shows that as the composition of income increases, life expectancy increases. Under these conditions, the variable 
Income composition of resources can be considered as a predictor of life expectancy. 

— There is a strong positive correlation between the variables Education and Life Expectancy with a Pearson 
correlation coefficient of 0.72 and statistically significant as the p-value is less than 0.001. The graph shows that the 
higher the education, the higher the life expectancy. Under these conditions, the education variable can be considered a 
predictor of life expectancy. 


Pearson’s coefficient of correlation is —0.56 with P_value 7.670715201361051e-238 


Life expectancy 


10 20 30 40 50 
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Fig. 2. Correlation between life expectancy and HIV/AIDS variable (the authors' figure) 
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Pearson’s coefficient of correlation is -0.7 with P_value 0.0 
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Fig. 3. Correlation between life expectancy and the adult mortality variable (the authors’ figure) 


Pearson’s coefficient of correlation is -0.69 with P_value 0.0 
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Fig. 4. Correlation between life expectancy and the income composition variable (the authors' figure) 
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Pearson’s coefficient of correlation is 0.56 with P_value 6.853943082465755e-244 
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Fig. 5. Correlation between life expectancy and the BMI variable (the authors' figure) 


Pearson’s coefficient of correlation is 0.72 with P_value 0.0 
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Fig. 6. Correlation between life expectancy and the education variable (the authors' figure) 


Based on the results of the Pearson correlation, we detect the variables that have an influence on life expectancy. 
However, the Pearson correlation is not sufficient to determine the predictors of life expectancy. For this purpose, we 
will run several regression models to select the one with the smallest AIC, the highest fit, and the smallest mean square 
error, R? and the smallest root mean square error (RMSE). 
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Methodology 

In order to determine the variables that predict life expectancy, different regression models are used, namely, 
multiple linear regression, rigid regression and lasso regression. We will then examine the criteria for selecting (p — 1) 
explanatory variables from the k available explanatory variables. These criteria are: Mallows' Cp criterion, the 
coefficient of determination, R* The Bayesian information criterion (BIC), the Akaike information criterion (AIC). 


Multiple linear regression 
Multiple linear regression is an immediate generalization of simple linear regression. In multiple linear regression, 
the function F that we want to estimate no longer depends on a single variable, but on several. If we have n pairs of the 


form (Xj = (Xj, Xi25 - »Xim) ER™, y; eR™), with y; the result obtained for the observation X; = (%;1, Xi2, ..-.-+-- ; 
X;m), then the function we wish to estimate will be of the general form below : 
F (Xj = (%),1, X25 eee »Xim) = 1X1 + A2Xj2 + + AmXim + Ao- (1) 
The objective is to estimate the vector A = Q,, Q2 , ..., Am, Ap So that the function F is as close as possible to y; . 


As with simple linear regression, the least squares method can be used to find the vector A, and the function to be 
minimized will be defined as follows: 


E= dical Yi —F (Xi1, Xi2> seeeeeee > Xia dl (2) 
One of the most difficult aspects of abundant regression algorithms is to determine how to converge to the 
configurations (@,, @z , ..., Am, Ag) that yield errors € = Y; — y; and avoid the trap of over-learning. 


The best-known approach to minimizing the error calculation function E while avoiding over-learning is the 
introduction of the concept of regularization. 

There are two regularizations widely used with regression models: the Lasso regularization and the Ridge 
regularization. 

Lasso and Ridge regression 

Lasso regression is a regression model in which the selection and regulation of variables take place simultaneously. 
This method uses a penalty that affects the value of the regression coefficients. With Lasso regularization, the error 
function to be minimized becomes: 


2 

Exasso = veal Yi —F (Xi1, Xi2> seeeeees > Xim)] + ADix0 | ail. (3) 

The difference between E and Ejgssq is that in Eygss9 we have added the term AY”, |a;|? to further sanction 
solutions with values of (a1, Q2 , ..., Am, Ag). 


Ridge regression is a regularised regression algorithm that performs an L2 regularisation by adding an L2 penalty, 
which is equal to the square of the magnitude of the coefficients. With Ridge regularization, large values of (@,, Q2 , ..., 
Am, Ag) are more protected, and the error function to be minimized becomes: 

Entage = Lic Vem? Oise Meg een Rill” FAL Oe. (4) 
We note that in both the Lasso and Ridge regularization cases, when the value is set to 0, then E = Ejacso = Eriage- 

Mallows' Cp criterion 

Mallows' Cp is a selection criterion between several regression models. It compares the accuracy and bias of the full 
model with those of models containing a subset of predictors. Mallows’ Cp criterion is defined from the following 
formula: 


Cy =~ (n-2p). (5) 


But the problem is that we can no longer estimate 6% by s? = ares because C, would always be equal to p and then it 
would no longer be interesting [9]. So, in practice, we estimate 5? by the s* of the model that involves all k explanatory 
variables of the available model, then we choose among the models the one for which Mallows' Cp criterion is closest 
to p. 

“The coefficient of determination R? 

The website R? is the simplest to use. However, with the introduction of new variables, it increases monotonically 
even if they are weakly correlated with the explained variable. It is therefore advisable to turn to the use of other criteria 
such as the adjusted R?adjusted criterion, Mallows’ Cp, the AIC and AICc criteria, the BIC criterion. 

The adjusted coefficient of determination R? is the evolved version of the coefficient of determination R?. 

The adjusted R? determines the amount of variance of the dependent variable, which can be explained by the 
independent variable. On the basis of the fitted value R? value, one can judge whether the data in the regression 
equation are appropriate. The higher the R? the higher the fitted value, the better the regression equation because it 
implies that the independent variable is chosen to determine the dependent variable. 

The Bayesian information criterion BIC 

The Bayesian Information Criterion (BIC) is derived from the Akaike Information Criterion (AIC) and is defined 
by : BIC = -2 log ( L) + k log (n). It is more parsimonious than the AIC criterion because it penalises the number of 


variables present in the model more. According to [9], the AIC was introduced to retain the variables relevant to the 
forecast, whereas the BIC criterion aims at selecting the statistically significant variables in the model. 
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The Akaike Information Criterion (AIC) 

The Akaike Information Criterion (AIC) is a mathematical method that is applied to models estimated by a 
maximum likelihood method to assess how well a model fits the data from which it was generated. AIC is applied to 
analysis of variance, multiple linear regression, logistic regression and Poisson regression. The AIC criterion is defined 
by: 

AIC = 2K —- 2Log (L), (6) 
where L is the maximised likelihood and k is the number of model parameters. With this criterion, the deviance of the 
model -2 log ( L) is penalised by two times the number of parameters. 

Therefore, the AIC represents a compromise between bias, which decreases with the number of parameters, and 
parsimony, the desire to describe the data with as few parameters as possible. 

The rigour would dictate that all models compared should derive from the same ‘complete’ model included in the list of 
models compared. 

The best model is the one with the lowest AIC. When the number of parameters k is large compared to the number 
of observations n, i.e., if N/k < 40, it is recommended to use the corrected AIC. The corrected Akaike information 


criterion, AICc, is defined by: 
2K(K+1) 


AICc = AIC + ———. (7) 
n-K-1 

Results 

Our analysis has shown that life expectancy increases over the years, and that it is on average higher in developed 
countries than in developing countries. 

This study has also led us to the conclusion that the model chosen for the selection of life expectancy predictors is 
multiple linear regression (Table 1), as this model obtained the lowest Akaike information criterion of 6109.07, an 
adjusted coefficient of 85% and an RMSE of 3.85. R? of 85 % and an RMSE of 3.85. 

These measures were better than those of the Lasso and Ridge regression models. According to this model and 
following the p-value of less than 2.2e71°, all variables are significant, except for: Alcohol, Hepatitis B, Measles, 
Population, Slimness.1.19, Slimness.5.9. This means that we can do without these variables to explain life expectancy. 
However, applying the Akaike information criterion to the multiple linear regression model, the variables that best 
explain life expectancy are: adult mortality, infant deaths, percentage of expenditure, measles, under-five deaths, polio, 


total expenditure, diphtheria, HIV/AIDS, GDP, thinness. 1.19 years, income composition, and school enrolment. 


Table | 
Models Adjusted R? RMSE 
Multiple Linear Regression 0.85 3.85 
Lasso Regression 0.82 3.85 
Ridge Regression 0.82 3.91 


Conclusion 

Before analysing this data set, we had the impression that life expectancy could be increased if we had more money. 
This is because it takes money to be healthy and to receive appropriate medical treatment. Moreover, if a country is 
economically developed (GDP), all its citizens can afford appropriate medical treatment. This would mean that life 
expectancy depends largely on economic factors. However, after analysing this dataset, we have concluded that life 
expectancy is mainly affected by adult mortality, infant mortality, percentage of expenditure, measles, under-five 
mortality, polio, total expenditure, diphtheria, HIV/AIDS, GDP, wasting. 1.19 years, income composition, and 
schooling. This makes sense because if a person is educated enough to recognize health problems, they can make 
appropriate lifestyle changes, including but not limited to diet and exercise, which would ideally extend their life 
expectancy. Education can change a person's perception and help them understand the benefits of being fit and its 
impact on health. In addition, a higher level of education could be linked to a higher income, and a higher income 
would mean higher spending on health and fitness. Thus, education is directly or indirectly a good predictor of life 
expectancy. Various machine learning models have been used for training. Among these models, the multiple linear 
model has proven to be very effective in determining both the coefficient of determination and the errors. This model 
can therefore be used for the prediction of life expectancy. 
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